WatchDog AI

Author

Jorge Bris Moreno, Brian Kwon, and Thomas Sigall

Published

April 28, 2025

Introduction

Voter information is a crucial part of democratic processes. For this reason, groups like the one lead by Dr. Thessalia Merivaki and Dr. Mara Suttmann-Lea have been working on studying these communications. By tracking the social media content posted by Local and Sate Election Officials, they have been able to study the topics, type of information, and styles used by these officials. They have also been able to ensure the reliability of the information posted by these officials and make accionable recommendations on the communication strategies of these officials. However, due to the amount of content shared by the officials and the lack of resources available to them, the officials have been relying more and more on AI generated content to help them with their social media posts. While the content may be accurate, some poster present missespellings and other images don’t look realistic enough. Thus, Dr. Merivaki and Dr. Suttmann-Lea have reached out to us to help them detect these AI generated images to later study if they are causing mistrust on the public.

To do so, we have created WatchDog AI, which is a pipeline aimed to detect “harmful AI images”. We define harmful AI images as any post that makes the viewer doubt about the providence of the image (AI generated) and thus can cause mistrust on the viewer. Again, note that if an image is so well developed that looks real, we are not accounting it as “Harmful AI”, as AI can be an effective tool utilized by Election Officials, providing them with the necessary equipment to develop effective social media campaigns with limited resources.

Based on the data we want to classify, we have divided the data into two categories: posters and realistic images. We count as harmful AI images posters that have misspellings and realistic images that don’t look “real” or cause doubts to the viewer. Thus, our proposed pipeline tries to account for these cases.

Pipeline

The pipeline below is WatchDog AI. Fail means it is flagged as AI and Pass that is not. It is accounting for the scenarios encountered in our Election Officials Dataset. However, the models have been trained with other datasets due to our data not being labeled. However, later in the report you can see our eye-test for our data. Additionally, it is worth noting that this pipeline is intended for having a human in the loop revising the harmful AI flagged images to study their repercussion on the trust of viewers.

Poster Classification

The first stage of our pipeline involves working with images that are posters, a common method of communcation for election officials. This was split into two components: one model to classify if the image is a poster or not, and another model to classify if the image is a harmful AI poster or not.

Poster/Non-Poster Classification

Data for the initial poster classification model was collected from Google’s Open Images Dataset (https://storage.googleapis.com/openimages/web/index.html). This dataset containts around 9,000,000 images with 600+ labels. For this initial poster classification model, we took 5,000 images from the dataset that were labeled as posters and 5,000 images across all other labels.

We took a transfer learning approach to this problem, using a frozen ResNet-18 backbone and a custom classifer head. This custom head consisted of a 256-unit hidden layer with a ReLU activation function, followed by a 1-unit output layer with a sigmoid activation function. Training was done using binary cross entropy loss and an Adam optimizer. Early stopping kicked in at 18 epochs. Training progression can be seen below.

In terms of performance for this model, an accuracy of 94.4% was achieved on the validation set and 92.5% on the test set, indicating that the model performed well at classifying images as posters or not. A few different models were tried, including a simple CNN built from scratch and as well as a CNN with batch normalization and dropout. The simple 3-block CNN achieved a respectable accuracy of 87.2% on the validation set, but the transfer learning model trained just as quickly and achieved a higher accuracy. Leveraging the pre-trained features from ResNet-18 proved to be effective for this task, even with freezing the backbone to improve training times.

Harmful AI Poster Classification

The goal of the second stage of the pipeline is to flag an image for human review if it is a potentially harmful AI-generated poster. “Harmful” in this context is defined as an image that is a poster that contains misspellings or other artifacts that are not typically found in posters, leading to distrust on the part of the viewer and a counter-productive poster from the perspective of the election official.

The dataset for this model was collected differently. Obtaining AI generated posters proved to be difficult, so we instead generated them ourselves. We utilized OpenAI’s DALL-E image generation client to create 472 posters. A set of 10 distinct prompts covering a variety of themes were used to guide the generation. These images formed the basis for our “Harmful AI” class examples as they all clearly contain artifacts that are not typically found in posters. We then took an equal number of non-AI generated posters from the previously used dataset to form our “Non-Harmful AI” class examples.

Due to the similarity of the task to the poster/non-poster classification, we used a similar model structure, thought this time a ResNet-50 backbone was used due to the increased complexity of the task. A custom classifier head was used again, this time with a 256-unit hidden layer with a ReLU activation function, followed by a 1-unit output layer with a sigmoid activation function. Dropout was added this time as well, with a dropout rate of 0.7. Training was done using binary cross entropy loss and an Adam optimizer. Early stopping kicked in at 15 epochs. Training progression can be seen below.

In terms of performance for this model, an accuracy of 89.4% was achieved on the validation set and 90.6% on the test set, indicating that the model performed well at classifying images as harmful AI posters or not. This time only the one model was trained as the performance was satisfactory. Again, leveraging the pre-trained features from even the frozen ResNet-50 backbone proved to be effective for this task.

Harmful AI Non-Poster Detection

For AI detection on non-poster-like images, we trained two models for our binary classification task.: fine-tuned ResNet18 and Vision Transformer. We used two datasets for this part of the pipeline. The first is CIFAKE, which contains 120,000 REAL and FAKE images derived from CIFAR-100. FAKE images were generated using a diffusion model based on CIFAR-100 images. The second dataset came from the 2023 Fake or Real: AI-generated Image Discrimination Competition, consisting of 20,000 images. Both datasets have evenly distributed labels.

For our ResNet18 model, we combined the CIFAKE and competition datasets and split them into training (80%), validation (20%), and testing (20%) sets. We fine-tuned the model by freezing all layers except the final classification head, which we replaced with a single linear layer. The model was trained using the Adam optimizer with a learning rate of 0.001. This model achieved an accuracy of 85% on validation set and 75% on test set.

Having the promising result, we decided to push further and train a Vision Transformer (ViT) model. We chose ViT because AI-generated images often contain subtle artifacts—particularly in human faces—that ViTs are better equipped to capture. For the ViT model, we fine-tuned it using only the CIFAKE dataset to better assess its ability to generalize to unseen data. After training for 20,000 steps (approximately 5 epochs with 4,000 steps each), the model achieved 90% accuracy on the CIFAKE test set and 80% accuracy on an unseen test set of competition data. This was a significant improvement over our initial model and demonstrated the increased robustness of our approach.

Object Detection

The object detection algorithm is aimed to detect very small AI artifacts implanted in AI generated images that are not detectable by our detection model. To achieve high accruacies and detect very small objects, we have decided to fine-tune a two stage detector: MMDetection which can be found here: GitHub-link. Out the models offered by MMDetection, we have decided to use the htc_r50_fpn_1x_coco model, as it offers the best balance between accuracy and speed for tiny objects like AI artifacts.

This model is a two stage detector, which first proposes bounding boxes and then classifies them. While it is aimed to do both, object detection and segmentation, we have turned off the segmentation part of the model as it is not needed for our task. This model gets an image and resizes it to 800x800, then passes it through a backbone to extract features and a feature pyramid network to get a feature map of the image. Then, this feature map is passed through a region proposal network to propose bounding boxes and a classifier to classify them.

The goal of this model in the pipeline is to be able to detect all those images that the other models missed. It is for images that look more realistic and thus, the AI Non-poster detection model is not able to capture but are still harmful, as they add some artefacts that make the image look fake or “strange”.

Note: While this model can run easily on a CPU for inference, it was very hard to train on a CPU. Thus, we outsourced the training to a GPU. As a result, the model seems to e training perfectly, but due to our compute limitations, we have not trained it for more epochs and thus, this part is a proof of concept.

Data

Since labeled data available for this topic is scarce, we used only the dataset available in Roboflow here. However, to have more training data available, we moved most of the test images to the training and validation sets, leaving the following quantities:

  • Train images: 69
  • Validation images: 25
  • Test images: 10

Here is an example of how these images look like:

Moreover, for training, some data augmentations have been made leveraging PhotoMetricDistortion from the MMDetection model, only applying light changes on light (saturation, brightness, etc.) and resizing as they did in their original model for better detection.

Results

As mentioned before, we can see in the plots below how the models was rapidly learning. However, due to our compute limitations, we have not trained it for more epochs. While in our pipeline it seems to perform well, the confidence of the artefacts needs to be set at a very low number currently (0.15), which may add false positives. Thus, in the future, we would like to be able to train it for more epochs and thus, have a more robust model.

Future work

While the developed pipeline shows promise, there are several avenues for future work, focusing on the scope and robustness of each component.

Poster Classification

In terms of the poster classification model, the main focus of future work would be to enhance detection of subtle AI artifacts, specifically in text. Improving the training dataset for harmful AI posters by using more sophisticated prompt engineering and/or using a better image generation model would be a good start. More importantly, however, would be to make use of Optical Character Recognition (OCR) alongside the current visual classification model. A significant advancement would be developing a hybrid detection system that leverages both the visual and text-based features of AI-generated posters. Extracting text regions, analyzing content and style for AI-induced anomalies (inconsistent kerning, unusual font mixing, etc.) combined with visual artificat detection would not only improve accuracy but also enhance the explainability of why a poster is flagged as potentially harmful.

AI Detection

Regarding the general AI detection model, initial results suggest the fine-tuned ResNet, while effective on the training set, may lack robustness when extrapolating to entirely unseen image types or generation methods. As such, future work should therefore explore alternative and potentially more complex model architectures beyond standard CNNs. Investigating and expanding on the capabilities of Vision Transformers for this task would be helpful, given their different approach to capturing global image context, which might be advantageous for detecting diverse AI generation patterns.

Object Detection

The object detection component, aimed at identifying fine-grained artifacts, would significantly benefit from training for more epochs, as we can see that the model did not finish learning from the plots presented, and a larger and more diverse dataset. Actively seeking or generating data featuring a wider variety of subtle AI artifacts is crucial for improving its generalization capabilities. Additionally, exploring image segmentation alongside detection could provide more precise localization of AI artifacts, potentially leading to better detection performance. Finally, systematically benchmarking the current fine-tuned model against other state-of-the-art two-stage and one-stage detector would ensure the most effective model is used.

Conclusion

Throughout this project we focused on keeping a consistent definition for what was considered a “harmful” image. It was easy to fall back into the assumption that this task was to flag images as either AI generated or not, but the task was more nuanced than that. A poster can be AI generated, but if it is done well, it can be useful for the election official. Reframing the problem as classifying images as trustworthy or not trustworthy helped keep the overall goal in mind. Generative AI is a powerful tool that can reduce the already limited resources available to election officials, but when used poorly it can work against their goals and cause mistrust on the part of the public. Due to the subjective nature of “trust”, this pipeline is intended to be used as a tool for election officials to detect harmful AI images, but the final decision on whether an image is harmful or not should be made by a human. An image that is flagged ideally would be shown to a variety of potential voters to see how they would respond, whether it would encourage them to vote or cause distrust in regards to the voting process. Our pipeline showed promising results, but, as always, there is room for improvement. We hope that this pipeline can be a useful tool for election officials to detect harmful AI images and help them communicate with the public.